Having access to clean and safe drinking water is vital for good health and is considered a fundamental human right. It is also an integral part of any successful health protection policy, and is crucial for promoting both national and local development. In certain areas, research has demonstrated that investing in water supply and sanitation can have a positive economic impact, as the benefits of reduced health problems and associated healthcare expenses can outweigh the costs of implementing these interventions.
With the help of this dataset, we would identify whether water is safe for drinking or not based on its chemical features, such as pH level, hardness, chlorine quantities and other relevant parameters.
pH value: Indicator of acidic or alkaline condition of water status.
Hardness: Amount of calcium and magnesium salts.
Solids (Total dissolved solids - TDS): Amount of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc.
Chloramines: Chlorine and chloramine content amounts
Sulfate: Sulfates are naturally occurring substances that are found in minerals, soil, and rocks.
Conductivity: Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current.
Organic_carbon: Total Organic Carbon (TOC) is a measure of the total amount of carbon in organic compounds in pure water.
Trihalomethanes: THMs are chemicals which may be found in water treated with chlorine.
Turbidity: The turbidity of water is used to indicate the quality of waste discharge with respect to colloidal matter.
Potability: Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.
### Basic Data Manupulation libraries
import pandas as pd
import numpy as np
### Data Visualization libraries
import seaborn as sns
import plotly
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
plotly.offline.init_notebook_mode()
### Modeling and data preprocessing libraries
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
from sklearn.metrics import classification_report,accuracy_score,f1_score,recall_score,roc_curve, roc_auc_score
from scipy.stats import chi2_contingency
### Model interpretability library
import shap
shap.initjs()
## Extras
import pickle as pk ## To save model / variables
from copy import deepcopy
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
import os
import warnings
warnings.filterwarnings("ignore")
print("All libraries sucessfully loaded ")
All libraries sucessfully loaded
## Loading the dataset
df=pd.read_csv(r"Data/water_potability.csv")
df.head()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | NaN | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | NaN | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
df.tail()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3271 | 4.668102 | 193.681735 | 47580.991603 | 7.166639 | 359.948574 | 526.424171 | 13.894419 | 66.687695 | 4.435821 | 1 |
| 3272 | 7.808856 | 193.553212 | 17329.802160 | 8.061362 | NaN | 392.449580 | 19.903225 | NaN | 2.798243 | 1 |
| 3273 | 9.419510 | 175.762646 | 33155.578218 | 7.350233 | NaN | 432.044783 | 11.039070 | 69.845400 | 3.298875 | 1 |
| 3274 | 5.126763 | 230.603758 | 11983.869376 | 6.303357 | NaN | 402.883113 | 11.168946 | 77.488213 | 4.708658 | 1 |
| 3275 | 7.874671 | 195.102299 | 17404.177061 | 7.509306 | NaN | 327.459760 | 16.140368 | 78.698446 | 2.309149 | 1 |
## Shape of the data
df.shape
(3276, 10)
## Checking column types
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3276 entries, 0 to 3275 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ph 2785 non-null float64 1 Hardness 3276 non-null float64 2 Solids 3276 non-null float64 3 Chloramines 3276 non-null float64 4 Sulfate 2495 non-null float64 5 Conductivity 3276 non-null float64 6 Organic_carbon 3276 non-null float64 7 Trihalomethanes 3114 non-null float64 8 Turbidity 3276 non-null float64 9 Potability 3276 non-null int64 dtypes: float64(9), int64(1) memory usage: 256.1 KB
# Describe
df.describe()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 2785.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 2495.000000 | 3276.000000 | 3276.000000 | 3114.000000 | 3276.000000 | 3276.000000 |
| mean | 7.080795 | 196.369496 | 22014.092526 | 7.122277 | 333.775777 | 426.205111 | 14.284970 | 66.396293 | 3.966786 | 0.390110 |
| std | 1.594320 | 32.879761 | 8768.570828 | 1.583085 | 41.416840 | 80.824064 | 3.308162 | 16.175008 | 0.780382 | 0.487849 |
| min | 0.000000 | 47.432000 | 320.942611 | 0.352000 | 129.000000 | 181.483754 | 2.200000 | 0.738000 | 1.450000 | 0.000000 |
| 25% | 6.093092 | 176.850538 | 15666.690297 | 6.127421 | 307.699498 | 365.734414 | 12.065801 | 55.844536 | 3.439711 | 0.000000 |
| 50% | 7.036752 | 196.967627 | 20927.833607 | 7.130299 | 333.073546 | 421.884968 | 14.218338 | 66.622485 | 3.955028 | 0.000000 |
| 75% | 8.062066 | 216.667456 | 27332.762127 | 8.114887 | 359.950170 | 481.792304 | 16.557652 | 77.337473 | 4.500320 | 1.000000 |
| max | 14.000000 | 323.124000 | 61227.196008 | 13.127000 | 481.030642 | 753.342620 | 28.300000 | 124.000000 | 6.739000 | 1.000000 |
## Checking null values %
(df.isnull().sum()/df.shape[0]) * 100
ph 14.987790 Hardness 0.000000 Solids 0.000000 Chloramines 0.000000 Sulfate 23.840049 Conductivity 0.000000 Organic_carbon 0.000000 Trihalomethanes 4.945055 Turbidity 0.000000 Potability 0.000000 dtype: float64
labels = ["Non Potable","potable water"]
values = list(df["Potability"].value_counts()[0:2])
# pull is given as a fraction of the pie radius
fig = go.Figure(data=[go.Pie(labels=labels,textinfo='label+percent', values=values, pull=[0, 0.2])])
fig.show()
Looking at the above pie chart, we can conclude that the dataset contains about 61% of drinking water and the rest 39% belongs to non potable water.
# impact of ph on potable water because ph of potable water should be between 5.5 to 8.5
import plotly.express as px
fig = px.histogram(df, x="ph", color="Potability", marginal="box", hover_data=df.columns)
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()
Looking at the above plot, we can say that the ph of drinking and non-potable water both lie between 4 and 10 with the ideal value lying between 5.5 and 8.5.
# hard water is considered to be >= 120 ppm
fig = px.box(df, x="Hardness", color="Potability")
fig.update_layout(barmode='overlay')
fig.show()
Looking at the above plot, we can say that the hardness of both potable and non-potable water lies between 100 to 300ppm. And we know the hard water is generally >120ppm. This is suffice to say that we drink hard water which contains minerals.
# Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose
fig= px.violin(df, y="Solids", color="Potability", hover_data=df.columns,violinmode='overlay')
fig.show()
From the above plot, we observe that solids usually lie between 0 and 50k ppm. But the prescribable limit it between 500 to 1000ppm. With this, we say that the drinkign water available to us contains a lot of solids which can be deemed as harmful.
# Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.
fig = px.histogram(df, x="Chloramines", color="Potability", marginal="rug", hover_data=df.columns)
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()
The level of chlorine in water as shown above lies in the range 4 to 10 mostly. And the prescribable limit of chlorines in drinking water is 4ppm. With this again we can say that we drink water that have a slightly higher levels of chlorine in it.
# Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L).
# It ranges from 3 to 30 mg/L in most freshwater supplies,
# although much higher concentrations (1000 mg/L) are found in some geographic locations.
fig = px.histogram(df, x="Sulfate", color="Potability")
fig.show()
Looking at the sulfate concentration too, we can say that it is leaning towards the higher side. Some geogrphical locations have sulfate between 500 ppm to 1000ppm. However, when we compare that with freshwater (with 3 to 30ppm concentration of Sulfate) there is a huge difference.
# According to WHO standards, EC value should not exceeded 400 μS/cm.
# 30-800 is tap water
fig = px.violin(df, y="Conductivity", color="Potability", hover_data=df.columns,points='all', box=True)
fig.show()
From the above plot of Electric Conductivity in water, we can say that these water bodies taken in the dataset have a slightly higher EC > 400uS/cm.
# Typical TOC values in drinking water may range up to 25 ppm.
hist_data = [df["Organic_carbon"]]
group_labels = ['Organic Carbon'] # name of the dataset
fig = ff.create_distplot(hist_data, group_labels)
fig.show()
The amount of Organic Carbon seen in the water bodies in questions, we can conclude that very few of them have higher organic carbon (> 25ppm) which is considered unfit for drinking.
# THM levels up to 80 ppm is considered safe in drinking water
px.violin(df, y="Trihalomethanes", color="Potability", box=True)
The normal range of Trihalmethanes in the waterbodies generally lie between 20 to 100ppm. Generally, water unsafe for drinking have Trihalmethanes above 80ppm.
# Turbidity for safe drinking water shouldn't be more than 5 NTU
px.box(df, y="Turbidity", color="Potability", points="all")
The turbidity in water safe for drinking should not be more than 5 NTU. Based on the above plot, we can say that about 10% of the water bodies taken in this dataset have slightly higher amounts of turbidity.
## Missing data plot
sns.heatmap(df,cbar=False)
<AxesSubplot:>
corr = df.corr()
matrix = np.triu(np.ones_like(df.corr()))
plt.subplots(figsize=(15,8))
sns.heatmap(corr, annot=True,linewidths=0.2,mask=matrix)
plt.show()
With the above heatmap, we say that very few chemical features have a high correlation with Potability. The highest in comoarison to others is Sulfate which is negatively correlated with Potability.
feat_imp_df = df.copy(deep=True)
feat_imp_df.dropna(inplace=True)
feat_imp_df_x = feat_imp_df.drop('Potability',axis=1)
feat_imp_df_y = feat_imp_df["Potability"]
from sklearn.feature_selection import chi2
chi_scores = chi2(feat_imp_df_x,feat_imp_df_y)
chi_scores
(array([1.48242362e-01, 2.47437212e-02, 1.13317371e+04, 3.05663819e-01,
2.39841312e+00, 7.37150529e+00, 3.75054979e-01, 6.68603248e-01,
1.58630863e-01]),
array([0.70022071, 0.87500734, 0. , 0.58035331, 0.1214584 ,
0.00662654, 0.54026168, 0.413539 , 0.69042019]))
p_values = pd.Series(chi_scores[1],index = feat_imp_df_x.columns)
p_values.sort_values(ascending = False , inplace = True)
p_values.plot.bar()
<AxesSubplot:>
p_values < 0.05
Hardness False ph False Turbidity False Chloramines False Organic_carbon False Trihalomethanes False Sulfate False Conductivity True Solids True dtype: bool
From the above Chi Squared test, we compare the p-values of each checmical property and conclude that Solids and EC (Conductivity) have lowest p-values and hence is considered to be the most significant.
# Random Forest importances
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=500,
random_state=1)
forest.fit(feat_imp_df_x, feat_imp_df_y.values)
RandomForestClassifier(n_estimators=500, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(n_estimators=500, random_state=1)
forest.feature_importances_
array([0.14447052, 0.1163697 , 0.10996037, 0.11774813, 0.14751926,
0.09197327, 0.09214582, 0.09226456, 0.08754838])
# Creating importances_df dataframe
importances_df = pd.DataFrame({"feature_names" : forest.feature_names_in_,
"importances" : forest.feature_importances_})
# Plotting bar chart, g is from graph
g = sns.barplot(x=importances_df["feature_names"],
y=importances_df["importances"], order=importances_df.sort_values("importances",ascending = False).feature_names)
plt.xticks(range(len(importances_df)), rotation='vertical')
g.set_title("Feature importances", fontsize=14);
# RFE
from sklearn.feature_selection import RFE
# Init the transformer
rfe = RFE(estimator= RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=4)
# Fit to the training data
_ = rfe.fit(feat_imp_df_x, feat_imp_df_y)
# took 50 seconds
print("The best 10 columns are =>",feat_imp_df_x.loc[:, rfe.support_].columns)
The best 10 columns are => Index(['ph', 'Hardness', 'Chloramines', 'Sulfate'], dtype='object')
From the above feature selections, we can conclude that the chemical features with the highest importance is
x= df.drop(columns=["Potability"])
y=df["Potability"]
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size= 0.20, random_state = 101)# training & testing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
df_mean_train = pd.DataFrame(scaler.fit_transform(x_train))
df_mean_test = pd.DataFrame(scaler.transform(x_test))
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(df_mean_train)
SimpleImputer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SimpleImputer()
x_imtrain = imp_mean.transform(df_mean_train)
x_imtest = imp_mean.transform(df_mean_test)
def evaluate_model(model,x_train,y_train,x_test,y_test,fit=False):
'''
Model Evaluation for Classifier
:param model : model object
:param x_train: Train features
:param y_train: Train Target
:param x_test: Test features
:param y_test: Test Target
:param fit bool : True if model is already fited else false
:return: Train and Test Classification report and AUC- ROC Graph
'''
if fit == False:
model.fit(x_train,y_train)
train_pred=model.predict(x_train)
print("Training report")
print(classification_report(y_train, train_pred))
print("Testing report")
test_pred=model.predict(x_test)
print(classification_report(y_test, test_pred))
y_pred_prob=model.predict_proba(x_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob[:,1])
auc=roc_auc_score(y_test,y_pred_prob[:,1])
print("AUC Score")
print(auc)
plt.plot(fpr, tpr, color='orange', label='ROC')
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--',label='ROC curve (area = %0.2f)' % auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()
log_reg = LogisticRegression()
evaluate_model(log_reg,x_imtrain,y_train,x_imtest,y_test,fit=False)
Training report
precision recall f1-score support
0 0.61 1.00 0.76 1596
1 1.00 0.00 0.00 1024
accuracy 0.61 2620
macro avg 0.80 0.50 0.38 2620
weighted avg 0.76 0.61 0.46 2620
Testing report
precision recall f1-score support
0 0.61 1.00 0.76 402
1 0.00 0.00 0.00 254
accuracy 0.61 656
macro avg 0.31 0.50 0.38 656
weighted avg 0.38 0.61 0.47 656
AUC Score
0.5144454890899831
# Imputing with KNNImputer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
# Define scaler to set values between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
df_knn_train = pd.DataFrame(scaler.fit_transform(x_train))
df_knn_test = pd.DataFrame(scaler.transform(x_test))
# Define KNN imputer and fill missing values
knn_imputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
df_knn_imputed_train = pd.DataFrame(knn_imputer.fit_transform(df_knn_train))
df_knn_imputed_test = pd.DataFrame(knn_imputer.transform(df_knn_test))
log_reg = LogisticRegression()
evaluate_model(log_reg,df_knn_imputed_train,y_train,df_knn_imputed_test,y_test,fit=False)
Training report
precision recall f1-score support
0 0.61 1.00 0.76 1596
1 1.00 0.00 0.01 1024
accuracy 0.61 2620
macro avg 0.80 0.50 0.38 2620
weighted avg 0.76 0.61 0.46 2620
Testing report
precision recall f1-score support
0 0.61 1.00 0.76 402
1 0.00 0.00 0.00 254
accuracy 0.61 656
macro avg 0.31 0.50 0.38 656
weighted avg 0.38 0.61 0.47 656
AUC Score
0.5136424178320993
From the above 2 imputation techniques, we cobserve that the ROC for Mean Imputation (0.45) is equal to that of KNN imputation as well (0.45). Hence, we could take either. Moving forward, for simplicity, we would be considering the mean imputation technique.
We first used base model as Logistic Regression, We observed that its not able to seperate the classes. It indicates that we need to try advance modeling techniques as we know that logistic regression works well with linearly sperable data but not with Non-Linear data
Lets try out SVM, SVM can seprate data in better way
df.isna().sum()
ph 491 Hardness 0 Solids 0 Chloramines 0 Sulfate 781 Conductivity 0 Organic_carbon 0 Trihalomethanes 162 Turbidity 0 Potability 0 dtype: int64
# SVM Classifier
from sklearn.svm import SVC
svm_model=SVC(probability=True)
evaluate_model(svm_model,x_imtrain,y_train,x_imtest,y_test,fit=False)
Training report
precision recall f1-score support
0 0.69 0.96 0.80 1596
1 0.84 0.32 0.46 1024
accuracy 0.71 2620
macro avg 0.77 0.64 0.63 2620
weighted avg 0.75 0.71 0.67 2620
Testing report
precision recall f1-score support
0 0.67 0.95 0.79 402
1 0.77 0.27 0.40 254
accuracy 0.69 656
macro avg 0.72 0.61 0.59 656
weighted avg 0.71 0.69 0.64 656
AUC Score
0.7091510949191051
SVM performed well as compared to Logistic Regression but Recall is low
Now we will try out Tree based models like Random forest and boosting tree models, As we know that Tree based models can seperate non linear data in better way.
from sklearn.ensemble import RandomForestClassifier
rfc_model = RandomForestClassifier(n_estimators=100,
max_depth=2)
evaluate_model(rfc_model,x_imtrain,y_train,x_imtest,y_test,fit=False)
Training report
precision recall f1-score support
0 0.62 1.00 0.76 1596
1 0.88 0.05 0.10 1024
accuracy 0.63 2620
macro avg 0.75 0.52 0.43 2620
weighted avg 0.72 0.63 0.50 2620
Testing report
precision recall f1-score support
0 0.62 1.00 0.77 402
1 0.85 0.04 0.08 254
accuracy 0.63 656
macro avg 0.73 0.52 0.42 656
weighted avg 0.71 0.63 0.50 656
AUC Score
0.5951688408351942
Random Forest did'nt perform well it has low AUC
xgb = XGBClassifier(objective='binary:logistic')
xgb.fit(x_train, y_train)
evaluate_model(xgb,x_train,y_train,x_test,y_test,fit=True)
# %%
param_grid={
'learning_rate':[1,0.5,0.1,0.01,0.001],
'max_depth': [3,5,10,20],
'n_estimators':[10,50,100,200]
}
grid_xgb= RandomizedSearchCV(XGBClassifier(objective='binary:logistic'),param_grid, verbose=3,scoring = "roc_auc_ovr")
grid_xgb.fit(x_train,y_train)
print(grid_xgb.best_params_)
Training report
precision recall f1-score support
0 0.99 1.00 1.00 1596
1 1.00 0.99 1.00 1024
accuracy 1.00 2620
macro avg 1.00 1.00 1.00 2620
weighted avg 1.00 1.00 1.00 2620
Testing report
precision recall f1-score support
0 0.67 0.75 0.71 402
1 0.51 0.41 0.46 254
accuracy 0.62 656
macro avg 0.59 0.58 0.58 656
weighted avg 0.61 0.62 0.61 656
AUC Score
0.6567555921181494
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END learning_rate=0.1, max_depth=5, n_estimators=50;, score=0.645 total time= 0.0s
[CV 2/5] END learning_rate=0.1, max_depth=5, n_estimators=50;, score=0.652 total time= 0.0s
[CV 3/5] END learning_rate=0.1, max_depth=5, n_estimators=50;, score=0.656 total time= 0.0s
[CV 4/5] END learning_rate=0.1, max_depth=5, n_estimators=50;, score=0.628 total time= 0.0s
[CV 5/5] END learning_rate=0.1, max_depth=5, n_estimators=50;, score=0.653 total time= 0.0s
[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.590 total time= 0.1s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.652 total time= 0.1s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.602 total time= 0.1s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.644 total time= 0.0s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.607 total time= 0.0s
[CV 1/5] END learning_rate=1, max_depth=20, n_estimators=50;, score=0.617 total time= 0.1s
[CV 2/5] END learning_rate=1, max_depth=20, n_estimators=50;, score=0.638 total time= 0.1s
[CV 3/5] END learning_rate=1, max_depth=20, n_estimators=50;, score=0.618 total time= 0.1s
[CV 4/5] END learning_rate=1, max_depth=20, n_estimators=50;, score=0.626 total time= 0.1s
[CV 5/5] END learning_rate=1, max_depth=20, n_estimators=50;, score=0.637 total time= 0.1s
[CV 1/5] END learning_rate=0.5, max_depth=20, n_estimators=50;, score=0.627 total time= 0.1s
[CV 2/5] END learning_rate=0.5, max_depth=20, n_estimators=50;, score=0.680 total time= 0.1s
[CV 3/5] END learning_rate=0.5, max_depth=20, n_estimators=50;, score=0.646 total time= 0.1s
[CV 4/5] END learning_rate=0.5, max_depth=20, n_estimators=50;, score=0.633 total time= 0.1s
[CV 5/5] END learning_rate=0.5, max_depth=20, n_estimators=50;, score=0.669 total time= 0.1s
[CV 1/5] END learning_rate=0.001, max_depth=3, n_estimators=200;, score=0.580 total time= 0.2s
[CV 2/5] END learning_rate=0.001, max_depth=3, n_estimators=200;, score=0.595 total time= 0.1s
[CV 3/5] END learning_rate=0.001, max_depth=3, n_estimators=200;, score=0.548 total time= 0.1s
[CV 4/5] END learning_rate=0.001, max_depth=3, n_estimators=200;, score=0.582 total time= 0.1s
[CV 5/5] END learning_rate=0.001, max_depth=3, n_estimators=200;, score=0.554 total time= 0.2s
[CV 1/5] END learning_rate=0.5, max_depth=3, n_estimators=200;, score=0.614 total time= 0.2s
[CV 2/5] END learning_rate=0.5, max_depth=3, n_estimators=200;, score=0.660 total time= 0.2s
[CV 3/5] END learning_rate=0.5, max_depth=3, n_estimators=200;, score=0.602 total time= 0.2s
[CV 4/5] END learning_rate=0.5, max_depth=3, n_estimators=200;, score=0.639 total time= 0.2s
[CV 5/5] END learning_rate=0.5, max_depth=3, n_estimators=200;, score=0.614 total time= 0.2s
[CV 1/5] END learning_rate=1, max_depth=20, n_estimators=10;, score=0.621 total time= 0.0s
[CV 2/5] END learning_rate=1, max_depth=20, n_estimators=10;, score=0.632 total time= 0.0s
[CV 3/5] END learning_rate=1, max_depth=20, n_estimators=10;, score=0.608 total time= 0.0s
[CV 4/5] END learning_rate=1, max_depth=20, n_estimators=10;, score=0.628 total time= 0.0s
[CV 5/5] END learning_rate=1, max_depth=20, n_estimators=10;, score=0.635 total time= 0.0s
[CV 1/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.567 total time= 0.0s
[CV 2/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.529 total time= 0.0s
[CV 3/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.553 total time= 0.0s
[CV 4/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.549 total time= 0.0s
[CV 5/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.568 total time= 0.0s
[CV 1/5] END learning_rate=0.1, max_depth=20, n_estimators=100;, score=0.639 total time= 0.4s
[CV 2/5] END learning_rate=0.1, max_depth=20, n_estimators=100;, score=0.681 total time= 0.5s
[CV 3/5] END learning_rate=0.1, max_depth=20, n_estimators=100;, score=0.650 total time= 0.4s
[CV 4/5] END learning_rate=0.1, max_depth=20, n_estimators=100;, score=0.650 total time= 0.4s
[CV 5/5] END learning_rate=0.1, max_depth=20, n_estimators=100;, score=0.674 total time= 0.4s
[CV 1/5] END learning_rate=0.01, max_depth=10, n_estimators=50;, score=0.652 total time= 0.1s
[CV 2/5] END learning_rate=0.01, max_depth=10, n_estimators=50;, score=0.665 total time= 0.1s
[CV 3/5] END learning_rate=0.01, max_depth=10, n_estimators=50;, score=0.648 total time= 0.1s
[CV 4/5] END learning_rate=0.01, max_depth=10, n_estimators=50;, score=0.661 total time= 0.1s
[CV 5/5] END learning_rate=0.01, max_depth=10, n_estimators=50;, score=0.645 total time= 0.1s
{'n_estimators': 100, 'max_depth': 20, 'learning_rate': 0.1}
xgb_tuned = XGBClassifier(objective='binary:logistic',learning_rate =0.01,max_depth = 5,n_estimators = 50)
xgb_tuned.fit(x_train, y_train)
evaluate_model(xgb_tuned,x_train,y_train,x_test,y_test,fit=True)
Training report
precision recall f1-score support
0 0.67 0.98 0.80 1596
1 0.91 0.24 0.38 1024
accuracy 0.69 2620
macro avg 0.79 0.61 0.59 2620
weighted avg 0.76 0.69 0.64 2620
Testing report
precision recall f1-score support
0 0.65 0.95 0.77 402
1 0.72 0.19 0.30 254
accuracy 0.66 656
macro avg 0.68 0.57 0.54 656
weighted avg 0.68 0.66 0.59 656
AUC Score
0.663410310651467
XGBoost performed better
catboost = CatBoostClassifier(silent =True)
catboost.fit(x_train,y_train)
evaluate_model(catboost,x_train,y_train,x_test,y_test,fit=True)
# %%
cbc = CatBoostClassifier(silent =True)
#create the grid
grid = {'max_depth': [3,4,5,6,7,8,9],'n_estimators':[100, 200, 300]}
#instantiate GridSearchCV
gscv = GridSearchCV (estimator = cbc, param_grid = grid, scoring = "roc_auc_ovr"
, cv = 5)
Training report
precision recall f1-score support
0 0.86 0.99 0.92 1596
1 0.97 0.74 0.84 1024
accuracy 0.89 2620
macro avg 0.91 0.86 0.88 2620
weighted avg 0.90 0.89 0.89 2620
Testing report
precision recall f1-score support
0 0.69 0.88 0.77 402
1 0.66 0.37 0.48 254
accuracy 0.68 656
macro avg 0.68 0.63 0.63 656
weighted avg 0.68 0.68 0.66 656
AUC Score
0.7154189681513692
cbc = CatBoostClassifier(silent =True)
#create the grid
grid = {'max_depth': [3,4,5,6,7,8,9],'n_estimators':[100, 200, 300]}
#instantiate GridSearchCV
gscv = GridSearchCV (estimator = cbc, param_grid = grid, scoring = "roc_auc_ovr"
, cv = 5)
#fit the model using grid search
gscv.fit(x_train,y_train)
#returns the estimator with the best performance
print(gscv.best_estimator_)
#returns the best score
print(gscv.best_score_)
#returns the best parameters
print(gscv.best_params_)
<catboost.core.CatBoostClassifier object at 0x000002178AEF9E50>
0.6608004044333156
{'max_depth': 7, 'n_estimators': 300}
catboost_tuned = CatBoostClassifier(max_depth = 7,n_estimators=300,silent =True)
catboost_tuned.fit(x_train,y_train)
evaluate_model(catboost_tuned,x_train,y_train,x_test,y_test,fit=True)
Training report
precision recall f1-score support
0 0.90 0.99 0.94 1596
1 0.98 0.82 0.90 1024
accuracy 0.93 2620
macro avg 0.94 0.91 0.92 2620
weighted avg 0.93 0.93 0.92 2620
Testing report
precision recall f1-score support
0 0.70 0.87 0.77 402
1 0.66 0.39 0.49 254
accuracy 0.69 656
macro avg 0.68 0.63 0.63 656
weighted avg 0.68 0.69 0.67 656
AUC Score
0.6997199044149331
CatBoost performed best as AUC of 0.7
lgbm = lgb.LGBMClassifier()
lgbm.fit(x_train, y_train)
evaluate_model(lgbm,x_train,y_train,x_test,y_test,fit=True)
Training report
precision recall f1-score support
0 0.95 0.99 0.97 1596
1 0.99 0.92 0.96 1024
accuracy 0.97 2620
macro avg 0.97 0.96 0.96 2620
weighted avg 0.97 0.97 0.97 2620
Testing report
precision recall f1-score support
0 0.68 0.79 0.73 402
1 0.56 0.41 0.47 254
accuracy 0.64 656
macro avg 0.62 0.60 0.60 656
weighted avg 0.63 0.64 0.63 656
AUC Score
0.6663630665569789
param_test ={'num_leaves': sp_randint(6, 50),
"n_estimators" : [50,100,200,300]}
lgb_clf = lgb.LGBMClassifier(max_depth=7, random_state=314, silent=True, metric='None', n_jobs=4)
lgb_rs = RandomizedSearchCV(
estimator=lgb_clf, param_distributions=param_test,
n_iter=100,
scoring='roc_auc',
cv=3,
refit=True,
random_state=314,
verbose=True)
lgb_rs.fit(x_train,y_train)
print('Best score reached: {} with params: {} '.format(lgb_rs.best_score_, lgb_rs.best_params_))
Fitting 3 folds for each of 100 candidates, totalling 300 fits
Best score reached: 0.6526526546855381 with params: {'n_estimators': 50, 'num_leaves': 44}
tunned_lgb=lgb.LGBMClassifier(n_estimators = 50,num_leaves = 44)
fit_params={"early_stopping_rounds":30,
"eval_metric" : 'auc',
"eval_set" : [(x_test,y_test)],
'eval_names': ['valid'],
#'callbacks': [lgb.reset_parameter(learning_rate=learning_rate_010_decay_power_099)],
'verbose': 100}
tunned_lgb.fit(x_train,y_train,**fit_params)
evaluate_model(tunned_lgb,x_train,y_train,x_test,y_test,fit=True)
Training report
precision recall f1-score support
0 0.82 0.97 0.89 1596
1 0.94 0.66 0.78 1024
accuracy 0.85 2620
macro avg 0.88 0.82 0.83 2620
weighted avg 0.87 0.85 0.85 2620
Testing report
precision recall f1-score support
0 0.67 0.84 0.75 402
1 0.58 0.35 0.44 254
accuracy 0.65 656
macro avg 0.63 0.60 0.59 656
weighted avg 0.64 0.65 0.63 656
AUC Score
0.6780565675559211
After observing each and every model we observed that Catboost performed well as it gave us AUC 0f 0.70
explainer = shap.TreeExplainer(catboost_tuned)
shap_values = explainer.shap_values(x_test)
shap.summary_plot(shap_values, x_test)
The graph illustrates how the features affect the model's output. It indicates that changing the pH scale alone does not make the water potable, suggesting that there is an optimum pH range for safe drinking water. Additionally, the graph suggests that the sulfate level should be kept low for drinkable water.
x_test.iloc[0,:]
ph 5.735724 Hardness 158.318741 Solids 25363.016594 Chloramines 7.728601 Sulfate 377.543291 Conductivity 568.304671 Organic_carbon 13.626624 Trihalomethanes 75.952337 Turbidity 4.732954 Name: 2541, dtype: float64
We Will try to see how model is predicting for above features, it has predicted as water is not potable we will try to interpreat this using Shap
shap.force_plot(explainer.expected_value, shap_values[0,:], x_test.iloc[0,:])
By leveraging SHAP, we can infer that the presence of certain features such as ph, hardness, organic carbon, and sulfate suggests that the water is suitable for consumption. However, the features solid and chloramines indicate that the water does not fall within the acceptable range for safe drinking water. As a result, the overall output of the model indicates that the water is not potable.
def efficient_cutoff(actual_value,predicted):
'''
Model probablity threshold cutoff plot
:param actual_value : Actual target values
:param predicted: Predicted probabilities from the model
:return: Train and Test Classification report and AUC- ROC Graph
'''
probability_cutoff = []
accuracy_score_val = []
recall_score_val=[]
for i in range(30,50,2): ## Trying different probablity threshold values
predicted_x = deepcopy(predicted)
predicted_x[predicted_x >= i / 100] = 1 ## Classifying class 1 as greater than threshold
predicted_x[predicted_x < i / 100] = 0 ## Classifying class 0 as less then threshold
probability_cutoff.append(i/100)
accuracy_score_val.append(accuracy_score(actual_value,predicted_x)) ## Calulating Accuracy Scores
recall_score_val.append(recall_score(actual_value,predicted_x)) ## Caluclating Recall Scores
return (probability_cutoff,accuracy_score_val,recall_score_val)
pred= catboost_tuned.predict_proba(x_test)
efficient_cutoff(y_test,pred[:,1])
probability_cutoff,accuracy_score_val,recall_score_val=efficient_cutoff(y_test,pred[:,1])
fig = px.scatter( x=accuracy_score_val, y=recall_score_val,text=probability_cutoff, title='Threshold cutoff plot', labels={
"y":"Recall ",
"x": "Accuracy",
},)
fig.show()